layout feature
Layout-Aware Information Extraction for Document-Grounded Dialogue: Dataset, Method and Demonstration
Zhang, Zhenyu, Yu, Bowen, Yu, Haiyang, Liu, Tingwen, Fu, Cheng, Li, Jingyang, Tang, Chengguang, Sun, Jian, Li, Yongbin
Building document-grounded dialogue systems have received growing interest as documents convey a wealth of human knowledge and commonly exist in enterprises. Wherein, how to comprehend and retrieve information from documents is a challenging research problem. Previous work ignores the visual property of documents and treats them as plain text, resulting in incomplete modality. In this paper, we propose a Layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents (VRDs), so as to generate accurate responses in dialogue systems. LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents, becoming the largest VRD-based information extraction dataset to the best of our knowledge. We also develop benchmark methods that extend the token-based language model to consider layout features like humans. Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
Deep Neural Networks Evolve Human-like Attention Distribution during Reading Comprehension
Attention is a key mechanism for information selection in both biological brains and many state-of-the-art deep neural networks (DNNs). Here, we investigate whether humans and DNNs allocate attention in comparable ways when reading a text passage to subsequently answer a specific question. We analyze 3 transformer-based DNNs that reach human-level performance when trained to perform the reading comprehension task. We find that the DNN attention distribution quantitatively resembles human attention distribution measured by fixation times. Human readers fixate longer on words that are more relevant to the question-answering task, demonstrating that attention is modulated by top-down reading goals, on top of lower-level visual and text features of the stimulus. Further analyses reveal that the attention weights in DNNs are also influenced by both top-down reading goals and lower-level stimulus features, with the shallow layers more strongly influenced by lower-level text features and the deep layers attending more to task-relevant words. Additionally, deep layers' attention to task-relevant words gradually emerges when pre-trained DNN models are fine-tuned to perform the reading comprehension task, which coincides with the improvement in task performance. These results demonstrate that DNNs can evolve human-like attention distribution through task optimization, which suggests that human attention during goal-directed reading comprehension is a consequence of task optimization.
GroupLink: An End-to-end Multitask Method for Word Grouping and Relation Extraction in Form Understanding
Wang, Zilong, Zhan, Mingjie, Ren, Houxing, Hou, Zhaohui, Wu, Yuwei, Zhang, Xingyan, Liang, Ding
Forms are a common type of document in real life and carry rich information through textual contents and the organizational structure. To realize automatic processing of forms, word grouping and relation extraction are two fundamental and crucial steps after preliminary processing of optical character reader (OCR). Word grouping is to aggregate words that belong to the same semantic entity, and relation extraction is to predict the links between semantic entities. Existing works treat them as two individual tasks, but these two tasks are correlated and can reinforce each other. The grouping process will refine the integrated representation of the corresponding entity, and the linking process will give feedback to the grouping performance. For this purpose, we acquire multimodal features from both textual data and layout information and build an end-to-end model through multitask training to combine word grouping and relation extraction to enhance performance on each task. We validate our proposed method on a real-world, fully-annotated, noisy-scanned benchmark, FUNSD, and extensive experiments demonstrate the effectiveness of our method.
Aesthetic Visual Quality Evaluation of Chinese Handwritings
Sun, Rongju (Peking University) | Lian, Zhouhui (Peking University) | Tang, Yingmin (Peking University) | Xiao, Jianguo (Peking University)
Aesthetic evaluation of Chinese calligraphy is one of the most challenging tasks in Artificial Intelligence. This paper attempts to solve this problem by proposing a number of aesthetic feature representations and feeding them into Artificial Neural Networks. Specifically, 22 global shape features are presented to describe a given handwritten Chinese character from different aspects according to classical calligraphic rules, and a new 10-dimensional feature vector is introduced to represent the component layout information using sparse coding. Moreover, a Chinese Handwriting Aesthetic Evaluation Database (CHAED) is also built by collecting 1000 Chinese handwriting images with diverse aesthetic qualities and inviting 33 subjects to evaluate the aesthetic quality for each calligraphic image. Finally, back propagation neural networks are constructed with the concatenation of the proposed features as input and then trained on our CHAED database for the aesthetic evaluation of Chinese calligraphy. Experimental results demonstrate that the proposed AI system provides a comparable performance with human evaluation. Through our experiments, we also compare the importance of each individual feature and reveal the relationship between our aesthetic features and the aesthetic perceptions of human beings.